Improved Antidictionary Based Compression

نویسندگان

Maxime Crochemore

Gonzalo Navarro

چکیده

The compression of binary texts using antidictionaries is a novel technique based on the fact that some substrings (called “antifactors”) never appear in the text. Let sb be an antifactor, where b is its last bit. Every time s appears in the text we know that the next bit is b and hence omit its representation. Since building the set of all antifactors is space consuming at compression time, it is customary to limit the maximum length of antifactors considered up to a constant k. Larger k yields better compression of the text but requires more space at compression time. In this paper we introduce the notion of almost antifactors, which are strings that rarely appear in the text. More formally, almost antifactors are strings that, if we consider them as antifactors and separately code their occurrences as exceptions, the compression ratio improves. We show that almost antifactors permit improving compression with a limited amount of main memory to compress. Our experiments show that they obtain the same compression of the classical algorithm using only 30%–55% of its memory

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On-line Electrocardiogram Lossless Compression Using Antidictionary-Based Methods

This paper proposes on-line electrocardiogram (ECG) lossless data compression based on antidictionaries. An antidictionary is the set of all words of minimal length that never appear in the string. The proposed methods use the coders constructed by a training sequence of constant length of ECG, so that they work with an on-line manner in constant computational space. Their effectiveness is demo...

متن کامل

A tight upper bound on the size of the antidictionary of a binary string

An antidictionary is a set of words that never appear in a binary string. In 2000, Crochemore et al. (2000) presented a compression algorithm of binary text using antidictionary called DCA. Their coding algorithm has been tested on the Calgary Corpus, and their experimental results show that we get compression ratios equivalent to those of most common compressors such as pkzip. Recently, an onl...

متن کامل

Compression Using Antidictionaries

We give a new text compression scheme based on Forbidden Words ("antidictionary"). We prove that our algorithms attain the entropy for equilibrated binary sources. One of the main advantage of this approach is that it produces very fast decompressors. A second advantage is a synchronization property that is helpful to search compressed data and to parallelize the compressor. Our algorithms can ...

متن کامل

Pattern Matching in Text Compressed by Using Antidictionaries Yusuke

In this paper we focus on the problem of compressed pattern matching for the text compression using antidictionaries, which is a new compression scheme proposed recently by Crochemore et al. (1998). We show an algorithm which preprocesses a pattern of length m and an antidictionary M in O(m 2 + kMk) time, and then scans a compressed text of length n in O(n+ r) time to nd all pattern occurrences...

متن کامل